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Abstract 


We present a new framework based on walks in a graph for analysis and inference in Gaussian 
graphical models. The key idea is to decompose the correlation between each pair of variables as 
a sum over all walks between those variables in the graph. The weight of each walk is given by a 
product of edgewise partial correlation coefficients. This representation holds for a large class of 
Gaussian graphical models which we call walk-summable. We give a precise characterization of 
this class of models, and relate it to other classes including diagonally dominant, attractive, non- 
frustrated, and pairwise-normalizable. We provide a walk-sum interpretation of Gaussian belief 
propagation in trees and of the approximate method of loopy belief propagation in graphs with 
cycles. The walk-sum perspective leads to a better understanding of Gaussian belief propagation 
and to stronger results for its convergence in loopy graphs. 


Keywords: Gaussian graphical models, walk-sum analysis, convergence of loopy belief propaga- 
tion 


1. Introduction 


We consider multivariate Gaussian distributions defined on undirected graphs, which are often re- 
ferred to as Gauss-Markov random fields (GMRFs). The nodes of the graph denote random variables 
and the edges capture the statistical dependency structure of the model. The family of all Gauss- 
Markov models defined on a graph is naturally represented in the information form of the Gaussian 
density. The key parameter of the information form is the information matrix, which is the inverse 
of the covariance matrix. The information matrix is sparse, reflecting the structure of the defining 
graph such that only the diagonal elements and those off-diagonal elements corresponding to edges 
of the graph are non-zero. 

Given such a model, we consider the problem of computing the mean and variance of each 
variable, thereby determining the marginal densities as well as the mode. In principle, these can be 
obtained by inverting the information matrix, but the complexity of this computation is cubic in the 
number of variables. More efficient recursive calculations are possible in graphs with very sparse 
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structure—for example, in chains, trees and in graphs with “thin” junction trees. For these models, 
belief propagation (BP) or its junction tree variants efficiently compute the marginals (Pearl, 1988; 
Cowell et al., 1999). In large-scale models with more complex graphs, for example, for models 
arising in oceanography, 3D-tomography, and seismology, even the junction tree approach becomes 
computationally prohibitive. Iterative methods from numerical linear algebra (Varga, 2000) can be 
used to compute the marginal means. However, in order to efficiently compute both means and 
variances, approximate methods such as loopy belief propagation (LBP) are needed (Pearl, 1988; 
Yedidia, Freeman, and Weiss, 2003; Weiss and Freeman, 2001; Rusmevichientong and Van Roy, 
2001). Another important motivation for using LBP, emphasized for example by Moallemi and Van 
Roy (2006a), is its distributed nature which is important for applications such as sensor networks. 
While LBP has been shown to often provide good approximate solutions for many problems, it is 
not guaranteed to do so in general, and may even fail to converge. 

In prior work, Rusmevichientong and Van Roy (2001) analyzed Gaussian LBP on the turbo- 
decoding graph. For this special case they established that variances converge, means follow a 
linear system upon convergence of the variances, and that if means converge then they are correct. 
Weiss and Freeman (2001) analyzed LBP from the computation tree perspective to give a sufficient 
condition (equivalent to diagonal dominance of the information matrix) for convergence, and also 
showed correctness of the means upon convergence. Wainwright et al. (2003) introduced the tree 
reparameterization view of belief propagation and, in the Gaussian case, also showed correctness 
of the means upon convergence. Convergence of other forms of LBP are analyzed by Ihler et al. 
(2005), and Mooij and Kappen (2005), but unfortunately their sufficient conditions are not directly 
applicable to the Gaussian case. 

We develop a “walk-sum” formulation for computation of means, variances and correlations as 
sums over certain sets of weighted walks in a graph.!:* This walk-sum formulation applies to a wide 
class of Gauss-Markov models which we call walk-summable. We characterize the class of walk- 
summable models and show that it contains (and extends well beyond) some “easy” classes of mod- 
els, including models on trees, attractive, non-frustrated, and diagonally dominant models. We also 
show the equivalence of walk-summability to the fundamental notion of pairwise-normalizability, 
and that inference in walk-summable models can be reduced to inference in an attractive model 
based on a certain extended graph. 

We use the walk-sum formulation to develop a new interpretation of BP in trees and of LBP in 
general. Based on this interpretation we are able to extend the previously known sufficient condi- 
tions for convergence of LBP to the class of walk-summable models. Our sufficient condition is 
stronger than that given by Weiss and Freeman (2001) as the class of diagonally dominant models 
is a strict subset of the class of pairwise-normalizable models. Our results also explain why they did 
not find any examples where LBP does not converge. The reason is that they presumed pairwise- 
normalizability. We also give a new explanation, in terms of walk-sums, of why LBP converges to 
the correct means but not to the correct variances. The reason is that LBP captures all of the walks 
needed to compute the means but only computes a subset of the walks needed for the variances. 





1. After submitting the paper we became aware of a related decomposition for non-Gaussian classical spin systems in 
statistical physics developed by Brydges et al. (1983). Similarly to our work, the decomposition is connected to the 
Neumann series expansion of the matrix inverse, but in addition to products of edge weights, their weight of a walk 
includes a complicated multi-dimensional integral. 

2. Another interesting decomposition of the covariance in Gaussian models in terms of path sums has been proposed 
in Jones and West (2005). It is markedly different from our approach (e.g., unlike paths, walks can cross an edge 
multiple times, and the weight of a path is rather hard to calculate, as opposed to our walk-weights). 
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In general, walk-summability is not necessary for LBP convergence. Hence, we also provide a 
tighter (essentially necessary) condition for convergence of LBP variances based on a weaker form 
of walk-summability defined on the LBP computation tree. This provides deeper insight into why 
LBP can fail to converge—because the LBP computation tree is not always well-posed—which 
suggests connections to Tatikonda and Jordan (2002). 

In related work, concurrent with Johnson et al. (2006), Moallemi and Van Roy (2006a) have 
shown convergence of their consensus propagation algorithm, which uses a pairwise-normalized 
model. In this paper, we demonstrate the equivalence of pairwise-normalizability and walk-summability, 
which suggests a connection between their results and ours. In their more recent work (Moallemi 
and Van Roy, 2006b), concurrent with this paper, they make use of our walk-sum analysis of LBP, 
assuming pairwise-normalizability, to consider other initializations of the algorithm.*? However, the 
critical condition is still walk-summability, which is presented in this paper. 

In Section 2 we introduce Gaussian graphical models and describe exact BP for tree-structured 
graphs as well as approximate BP for loopy graphs, and their connection to Gaussian elimination. 
Next, in Section 3 we describe our walk-based framework for inference, define walk-summable 
models, and explore the connections between walk-summable models and other subclasses of Gaus- 
sian models. We present the walk-sum interpretation of LBP and our conditions for its convergence 
in Section 4. We discuss non-walksummable models, and tighter conditions for LBP convergence in 
Section 5. Finally, conclusions and directions for further work are discussed in Section 6. Detailed 
proofs omitted from the main body of the paper appear in the appendices. 


2. Preliminaries 


In this section we give a brief background of Gaussian graphical models (Section 2.1) and of Gaus- 
sian elimination and its relation to belief propagation (Section 2.2). 


2.1 Gaussian Graphical Models 


A Gaussian graphical model is defined by an undirected graph G = (V,E), where V is the set of 
nodes (or vertices) and E is the set of edges (a set of unordered pairs {i, j} C V), and a collection of 
jointly Gaussian random variables x = (x;,i € V). The probability density is given by 


p(x) x exp{—3 TIx+h'x} (1) 


where J is a symmetric, positive definite matrix (J > 0) that is sparse so as to respect the graph 
G: if {i,j} Z E then Jj; = 0. The condition J > 0 is necessary so that (1) defines a valid (i.e., 
normalizable) probability density. This is the information form of the Gaussian density. We call 
J the information matrix and h the potential vector. They are related to the standard Gaussian 
parameterization in terms of the mean u = E{x} and covariance P = E{(x—p)(x—)"} as follows: 


























u=Jh and P=Js", 


This class of densities is precisely the family of non-degenerate Gaussian distributions which are 
Markov with respect to the graph G (Speed and Kiiveri, 1986): if a subset of nodes B C V separates 





3. Here, we choose one particular initialization of LBP. However, fixing this initialization does not restrict the class of 
models or applications for which our results apply. For instance, the application considered by Moallemi and Van 
Roy (2006a) can also be handled in our framework by a simple reparameterization. 
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two other subsets A C V and C C V in G, then the corresponding subsets of random variables x4 
and xç are conditionally independent given xg. In particular, define the neighborhood of a node i to 
be the set of its neighbors: A((i) = {j | {i,j} € E}. Then, conditioned on xaq), the variable x; is 
independent of the rest of the variables in the graph. 

The partial correlation coefficient between variables x; and x; measures their conditional cor- 
relation given the values of the other variables xy\;j £ (xz,k E€ V \ {i, j}). These are computed by 
normalizing the off-diagonal entries of the information matrix (Lauritzen, 1996): 


A cov(xj3.xj|xy\ i) Jij 
4 =-— (2) 
var xilxyij)var(@jlav\ij) y ue JJ 





Fij 





Hence, we observe the relation between the sparsity of J and conditional independence between 
variables. In agreement with the Hammersley-Clifford theorem (Hammersley and Clifford, 1971), 
for Gaussian models we may factor the probability distribution 


P(x) jue) I| wax) 


ieV {i,j}€E 


in terms of node and edge potential functions:* 


Wi(xi) = exp{—5Aix? + hjx;} and Wij (xi,x;) = exp{—3 [xi xj] Bij [x] F: (3) 
Here, A; and B;; must add up to J such that 


xTJx =} Ane + L (747) Bi (24) - 
i {i,j}€E 


The choice of a decomposition of J into such A; and B;; is not unique: the diagonal elements J;; can 
be split in various ways between A; and B;;, but the off-diagonal elements of J are copied directly 
into the corresponding B;;. It is not always possible to find a decomposition of J such that both 
A; > 0 and Bj; > 0.5 We call models where such a decomposition exists pairwise-normalizable. 
Our analysis is not limited to pairwise-normalizable models. Instead we use the decomposition 
0 Jj 


Ai = Ji and Bij = Jy 0 | , which always exists, and leads to the following node and edge potentials: 


y(x) = exp{—SJux? +hixi} and wij(xi,x7) = exp{—xiijxj}- (4) 


Note that any decomposition in (3) can easily be converted to our decomposition (4) using local 
operations (the required elements of J can be read off by adding overlapping matrices). 

We illustrate this framework with a prototypical estimation problem. Suppose that we wish 
to estimate an unknown signal x (e.g., an image) based on noisy observations y. A commonly 
used prior model in image processing is the thin membrane model p(x) « exp({—4((@L)x7 + 





4. To be precise, it is actually the negative logarithms of wy; and w;; that are usually referred to as potentials in the 
statistical mechanics literature. We abuse the terminology slightly for convenience. 


1 0.6 0.6 
5. For example the model with J = [os ae o8] is a valid model with J > 0, but no decomposition into single and 


pairwise positive definite factors exists. This can be verified by posing an appropriate semidefinite feasibility problem, 
or as we discuss later through walk-summability. 
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BY 4i,j}en(xi—x;)”)))} where 0,8 > 0 and E specifies nearest neighbors in the image. This model 
is described by a sparse information matrix with J; = a+ B|AC(i)| and Jj; = —f for {i,j} € E. 

Now, consider local observations y, such that p(y|x) = J]; p(yilxi). The distribution of interest is 
then p(x|y) ~ p(y|x) p(x), which is Markov with respect to the same graph as p(x), but with modified 
information parameters. For instance, let y = x+ v where v is Gaussian distributed measurement 
noise with zero mean and covariance o7J. Then p(x\y) « exp{—5x7Jx +h" x}, where J = J+ 
41 and h = sy. Hence, introducing local observations only changes the potential vector h and 
the diagonal of the information matrix J. Without loss of generality, in subsequent discussion we 
assume that any observations have already been absorbed into J and h. 


2.2 Belief Propagation and Gaussian Elimination 


An important inference problem for a graphical model is computing the marginals p;(x;), obtained 
by integrating p(x) over all variables except x;, for each node i.f This problem can be solved very 
efficiently in graphs that are trees by a form of variable elimination, known as belief propagation, 
which also provides an approximate method for general graphs. 


Belief Propagation in Trees In principle, the marginal of a given node can be computed by re- 
cursively eliminating variables one by one until just the desired node remains. Belief propagation 
in trees can be interpreted as an efficient form of variable elimination. Rather than computing the 
marginal for each variable independently, we instead compute these together by sharing the results 
of intermediate computations. Ultimately each node j must receive information from each of its 
neighbors, where the message, mi~ ;(x;), from neighbor i to j represents the result of eliminating 
all of the variables in the subtree rooted at node i and including all of its neighbors other than j (see 
Figure 1). Since each of these messages is itself made up of variable elimination steps correspond- 
ing to the subtrees rooted at the other neighbors of node i, there is a set of fixed-point equations that 
relate messages throughout the tree: 


mj_, ;(x;) = f wislei.x vila) [| mili) dxi. 5) 
ken (i)\j 


Given these fixed-point messages, the marginals are obtained by combining messages at each node, 


pi) = yi) [] mol), 
kEN(i) 


and normalizing the result. 

The equations (5) can be solved in a finite number of steps using a variety of message sched- 
ules, including one schedule that corresponds roughly to sequential variable elimination and back- 
substitution (a first pass from leaf nodes toward a common, overall “root” node followed by a 
reverse pass back to the leaf nodes) and a fully parallel schedule in which each node begins by 
sending non-informative messages (all m;—; initially set to 1), followed by iterative computation 
of (5) throughout the tree. For trees, either message schedule will terminate with the correct val- 
ues after a finite number of steps (equal to the diameter of the tree in the case of the fully parallel 
iteration). 





6. Another important problem is computation of max-marginals p;(x;), obtaining by maximizing with respect to the 
other variables, which is useful to determine the mode ĉ = argmax p(x). In Gaussian models, these are equivalent 
inference problems because marginals are proportional to max-marginals and the mean is equal to the mode. 
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As we have discussed, there are a variety of ways in which the information matrix in GMRFs 
can be decomposed into edge and node potential functions, and each such decomposition leads to 
BP iterations that are different in detail.’ In our development we will use the simple decomposition 
in (4), directly in terms of the elements of J. 

For Gaussian models expressed in information form, variable elimination/marginalization cor- 
responds to Gaussian elimination.® For example, if we wish to eliminate a single variable i to obtain 
the marginal over U = V \i, the formulas yielding the information parameterization for the marginal 
on U are: 

Jy = Jy u — Jy iJ; Jiu and hy = hy — Jy iJ; Ni. 


Here Jy and hu specify the marginal density on xy, whereas Jy y and hy are a submatrix and a 
subvector of the information parameters on the full graph. The messages in Gaussian models can 
be parameterized in information form 


mj j(xj) = exp{— 4A jx} + Ahi jxj}, (6) 


so that the fixed-point equations (5) can be stated in terms of these information parameters. We do 
this in two steps. The first step corresponds to preparing the message to be sent from node i to node 
j by collecting information from all of the other neighbors of i: 


faj = Jii+ Ł AJz—i and hy j =hi+ Ł Ahi. (7) 
ken) \j keN(i)\j 


The second step produces the information quantities to be propagated to node j: 


Aji; = Sid {Iii and Ahi; = —Fied 


1 
j hjj. (8) 
As before, these equations can be solved by various message schedules, ranging from leaf-root- 
leaf Gaussian elimination and back-substitution to fully parallel iteration starting from the non- 
informative messages in which all AJj_,; and Ah;— j are set to zero. When the fixed point solution 
is obtained, the computation of the marginal at each node is obtained by combining messages and 
local information: 
J=Jit } Akai and hy=hj+ }, Ahi, (9) 
keN(i) keN(i) 

which can be easily inverted to recover the marginal mean and variance: 
Ui = Ih; and Pi, = i, 
In general, performing Gaussian elimination corresponds, upto a permutation, to computing an 
LDL’ factorization of the information matrix—that is, PJP’ = LDL" where L is lower-triangular, D 
is diagonal and P is a permutation matrix corresponding to a particular choice of elimination order. 
This factorization exists if J is non-singular. In trees, the elimination order can be chosen such that 
at each step of the procedure, the next node eliminated is a leaf node of the remaining subtree. Each 
node elimination step then corresponds to a message in the “upward” pass of the leaf-root-leaf form 





7. One common decomposition for pairwise-normalizable models selects A; > 0 and B;; > 0 in (3) (Plarre and Kumar, 
2004; Weiss and Freeman, 2001; Moallemi and Van Roy, 2006a). 

8. The connection between Gaussian elimination and belief propagation has been noted before by Plarre and Kumar 
(2004), although they do not use the information form. 
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A message m;_,; passed from node i to 
node j € N (i) captures the effect of elim- 
inating the subtree rooted at i. 





Figure 1: An illustration of BP message-passing on trees. 


of Gaussian BP. In particular, Di; = Ji, j at all nodes i except the last (here, j is the parent of node 
i when i is eliminated) and Dj; = J; for that last variable corresponding to the root of the tree. It is 
clear that Dj; > 0 for all i if and only if J is positive definite. We conclude that for models on trees, 
J being positive definite is equivalent to all of the quantities fa j and J; in (7),(9) being positive, a 
condition we indicate by saying that BP on this tree is well-posed. Thus, performing Gaussian BP 
on trees serves as a simple test for validity of the model. The importance of this notion will become 
apparent shortly. 


Loopy Belief Propagation The message passing formulas derived for tree models can also be 
applied to models defined on graphs with cycles, even though this no longer corresponds precisely 
to variable elimination in the graph. This approximation method, called loopy belief propagation 
(LBP), was first proposed by Pearl (1988). Of course in this case, since there are cycles in the graph, 


only iterative message-scheduling forms can be defined. To be precise, a message schedule { M (n)} 
(n) 


specifies which messages m;_, ;, corresponding to directed edges (i,j) EM (n) 9 are updated at step 


n. The messages in M (”) are updated using 


mP a= f yux) T ma) ax (10) 
keN(i)\j 
and m” j= mes” for the other messages. For example, in the fully parallel case all messages are 


updated at each iteration whereas, in serial versions, only one message is updated at each iteration. 
(n) 
inj 


For GMRFs, application of (10), with messages m; ; parameterized as in (6), reduces to iterative 


application of equations (7),(8). We denote the information parameters at step n by AJ o) jand Ahl” j 
We initialize LBP with non-informative zero values for all of the information parameters in these 
messages. It is well known that LBP may or may not converge. If it does converge, it will not, in 
general, yield the correct values for the marginal distributions. In the Gaussian case, however, it is 
known (Weiss and Freeman, 2001; Rusmevichientong and Van Roy, 2001) that if LBP converges, 
it yields the correct mean values but, in general, incorrect values for the variances. While there 
has been considerable work on analyzing the convergence of LBP in general and for GMRFs in 
particular, the story has been far from complete. One major contribution of this paper is analysis 
that both provides new insights into LBP for Gaussian models and also brings that story several 
steps closer to completion. 

A key component of our analysis is the insightful interpretation of LBP in terms of the so-called 
computation tree (Yedidia et al., 2003; Weiss and Freeman, 2001; Tatikonda and Jordan, 2002), 


which captures the structure of LBP computations. The basic idea here is that to each message m C) i 





9. For each undirected edge {i, j} € E there are two messages: mi— j for direction (i, j), and mj—; for (j,i). 
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aj wh. Md D 


Figure 2: (a) Graph of a Gauss-Markov model with nodes {1,2,3,4} and with edge weights (partial 
correlations) as shown. (b) The parallel LBP message passing scheme. In (c), we show 





how, after 3 iterations, messages link up to form the computation tree a of node 1 (the 


(3) 


subtree Tey. associated with message m,_’,,, is also indicated within the dotted outline). 
In (d), we illustrate an equivalent Gauss-Markov tree model, with edge weights copied 
from (a), which has the same marginal at the root node as computed by LBP after 3 
iterations. 


and marginal estimate p” n) 


; there are associated computation trees T, 7 } and po ") that summarize 


(n) 


their pedigree. Initially, these trees are just single nodes. When message m;_,; is computed, its 


computation tree T®. is constructed by joining the trees jhe Be for all neighbors k of i except 


inj 
j, at their common root node i and then adding an additional edge (i, j) to form 7”) 


inj 
j. When marginal estimate p” (n) 
trees res for all neighbors k of i, at their common root. Each node and edge of the original 
graph may be replicated many times in the computation tree, but in a manner which preserves the 


rooted at 


is computed, its computation tree 7,” is formed by joining the 


local neighborhood structure. Potential functions are assigned to the nodes and edges of T” by 
copying these from the corresponding nodes and edges of the original loopy graphical model. In 
this manner, we obtain a Markov tree model in which the marginal at the root node is precisely 


p” as computed by LBP. In n eae of the fully parallel form of LBP, this leads to a collection of 


“balanced” computation trees ee ) (assuming there are no leaf nodes in G) having uniform depth n, 
as illustrated in Figure 2. The same construction applies for other message schedules with the only 
difference being that the resulting computation trees may grow in a non-uniform manner. Our walk- 
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sum analysis of LBP in Section 6, which relies on computation trees, applies for general message 
passing schedules. 

As we have mentioned, BP on trees, which corresponds to performing Gaussian elimination, is 
well-posed if and only if J is positive definite. LBP on Gaussian models corresponds to Gaussian 
elimination in the computation tree, which has its own information matrix corresponding to the 
unfolding illustrated in Figure 2 and involving replication of information parameters of the original 
loopy graphical model. Consequently, LBP is well-posed, yielding non-negative variances at each 
stage of the iteration, if and only if the model on the computation tree is valid, that is, if and only 
if the information matrix for the computation tree is positive definite. Very importantly, this is not 
always the case (even though the matrix J on the original graph is positive definite). The analysis 
in this paper, among other things, makes this point clear through analysis of the situations in which 
LBP converges and when it fails to converge. 


3. Walk-Summable Gaussian Models 


Now we describe our walk-sum framework for Gaussian inference. It is convenient to assume that 
we have normalized our model (by rescaling variables) so that J; = 1 for all i. Then, J = I — R where 
R has zero diagonal and the off-diagonal elements are equal to the partial correlation coefficients rj; 
in (2). We label each edge {i, j} of the graph G with partial correlations r;; as edge weights (e.g., 
see Figures 3 and 5). 


3.1 Walk-Summability 


A walk of length / > 0 in a graph G is a sequence w = (wo, w1, ...,w1) of nodes wg € V such that 
each step of the walk (wx, wx+1) corresponds to an edge of the graph {wz,wx41} E€ E. Walks may 
visit nodes and cross edges multiple times. We let /(w) denote the length of walk w. We define the 
weight of a walk to be the product of edge weights along the walk: 


o(w) = JLi . 
k=1 


We also allow zero-length “self” walks w = (v) at each node v for which we define ọ(w) = 1. 
To make a connection between these walks and Gaussian inference, we decompose the covariance 
matrix using the Neumann power series for the matrix inverse: 1° 


P=J'=(I—R)'= R", for p(k) <1, 
k=0 


Here p(R) is the spectral radius of R, the maximum absolute value of eigenvalues of R. The power 
series converges if p(R) < 1.!! The (i, j)-th element of R’ can be expressed as a sum of weights of 
walks w that go from i to j and have length / (denoted w : i 4 j): 


(Rij = y Fi wi Fwi woe F w1, j = » o(w). 


Wis. WII wi 





10. The Neumann series holds for the unnormalized case as well: J = D — K, where D is the diagonal part of J. With the 


weight of a walk defined as 6(w) = me Kow. re) Dy,,w,, all our analysis extends to the unnormalized case. 
11. Note that p(R) can be greater than 1 while Z — R > 0. This occurs if R has an eigenvalue less than —1. Such models 
are not walk-summable, so the analysis in Section 5 (rather than Section 4.2) applies. 
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The last equality holds because only the terms that correspond to walks in the graph have non-zero 
contributions: for all other terms at least one of the partial correlation coefficients ry, wp, 18 Zero. 
The set of walks from i to j of length / is finite, and the sum of weights of these walks (the walk- 
sum) is well-defined. We would like to also define walk-sums over arbitrary countable sets of walks. 
However, care must be taken, as walk-sums over countably many walks may or may not converge, 
and convergence may depend on the order of summation. This motivates the following definition: 

We say that a Gaussian distribution is walk-summable (WS) if for all i, j € V the unordered sum 
over all walks w from i to j (denoted w : i — j) 


L ow) 


wij 


is well-defined (i.e., converges to the same value for every possible summation order). Appealing to 
basic results of analysis (Rudin, 1976; Godement, 2004), the unordered sum is well-defined if and 
only if it converges absolutely, that is, if Y,,.;.;|0(w)| converges. 

Before we take a closer look at walk-summability, we introduce additional notation. For a matrix 
A, let A be the element-wise absolute value of A, that is, A; j= |A; re We use the notation A > B for 
element-wise comparisons, and A = B for comparisons in positive definite ordering. The following 
version of the Perron-Frobenius theorem (Horn and Johnson, 1985; Varga, 2000) for non-negative 
matrices (here R > 0) is used on several occasions in the paper: 


Perron-Frobenius theorem There exists a non-negative eigenvector x > 0 of R with eigenvalue 
p(R). If the graph G is connected (where r;; 4 0 for all edges of G) then p(R) and x are strictly 
positive and, apart from yx with y > 0, there are no other non-negative eigenvectors of R. 

In addition, we often use the following monotonicity properties of the spectral radius: 


(i) p(R) <p(R) Gi) If Ri < R then p(Ri) < p(R2). (11) 
We now present several equivalent conditions for walk-summability: 


Proposition 1 (Walk-Summability) Each of the following conditions are equivalent to walk-summability: 
(i) Lyi; |0(w)| converges for all i,j € V. 
(ii) YR! converges. 
(iii) p(R) <1. 
(iv) I-R > 0. 


The proof appears in Appendix A. It uses absolute convergence to rearrange walks in order of 
increasing length, and the Perron-Frobenius theorem for part (iv). The condition p(R) < 1 is stronger 
than p(R) < 1. The latter is sufficient for the convergence of the walks ordered by increasing length, 
whereas walk-summability enables convergence to the same answer in arbitrary order of summation. 
Note that (iv) implies that the model is walk-summable if and only if we can replace all negative 
partial correlation coefficients by their absolute values and still have a well-defined model (i.e., with 
information matrix 7 — R > 0). We also note that condition (iv) relates walk-summability to the 
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Figure 3: Example graphs: (a) 4-cycle with a chord. (b) 5-cycle. 


so-called H-matrices in linear algebra (Horn and Johnson, 1991; Varga, 2000).!? As an immediate 
corollary, we identify the following important subclass of walk-summable models: 


Corollary 2 (Attractive Models) Let J = I — R be a valid model (J > 0) with non-negative partial 
correlations R > 0. Then, J = I — R is walk-summable. 


A superclass of attractive models is the set of non-frustrated models. A model is non-frustrated if 
it does not contain any frustrated cycles, that is, cycles with an odd number of negative edge weights. 
We show in Appendix A (in the proof of Corollary 3) that if the model is non-frustrated, then one 
can negate some of the variables to make the model attractive!?. Hence, we have another subclass 
of walk-summable models (the inclusion is strict as some frustrated models are walk-summable, see 
Example 1): 


Corollary 3 (Non-frustrated models) Let J = I — R be valid. If R is non-frustrated then J is walk- 
summable. 


Example 1. In Figure 3 we illustrate two small Gaussian graphical models, which we use 
throughout the paper. In both models the information matrix J is normalized to have unit diago- 
nal and to have partial correlations as indicated in the figure. Consider the 4-cycle with a chord 
in Figure 3(a). The model is frustrated (due to the opposing sign of one of the partial correla- 
tions), and increasing r worsens the frustration. For 0 < r < 0.39039, the model is valid and 
walk-summable: for example, for r = 0.39, Amin(J) = 0.22 > 0, and p(R) ~ 0.9990 < 1. In the 
interval 0.39039 < r < 0.5 the model is valid, but not walk-summable: for example, for r = 0.4, 
Amin = 0.2 > 0, and p(R) © 1.0246 > 1. Also, note that for R (as opposed to R), p(R) < 1 for 
r < 0.5 and p(R) > 1 for r > 0.5. Finally, the model stops being diagonally dominant above r = i 
but walk-summability is a strictly larger set and extends until r ~ 0.39039. We summarize various 
critical points for this model and for the model in Figure 3(b) in the diagram in Figure 4. 


Here are additional useful implications of walk-summability, with proof in Appendix A: 


Proposition 4 (WS Necessary Conditions) All of the following are implied by walk-summability: 





12. A (possibly non-symmetric) matrix A is an H-matrix if all eigenvalues of the matrix M(A), where M;; = |Aj;|, and 
Mij = —|Ai;| for i Æ j, have positive real parts. For symmetric matrices this is equivalent to M being positive definite. 
In (iv) J is an H-matrix since M(J) =1—R > 0. 

13. This result is referred to in Kirkland et al. (1996). However, in addition to proving that there exists such a sign 
similarity, our proof also gives an algorithm which checks whether or not the model is frustrated, and determines 
which subset of variables to negate if the model is non-frustrated. 
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Diag. dominant | 0.33333 Diag. dominant 0.5 
Walksummable 0.39039 Walksummable 0.5 
p(R) <1 0.5 p(R) <1 0.5 
valid 0.5 valid 0.61803 
0 02 0.4 r 06 08 1 0 02 0.4 f 06 08 1 
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Figure 4: Critical regions for example models from Figure 3. (a) 4-cycle with a chord. (b) 5-cycle. 


(i) p(R) <1. 
(ii) J=I-R>0. 


(ii) Rt = (I— RY. 


Implication (ii) shows that walk-summability is a sufficient condition for validity of the model. 
Also, (iii) shows the relevance of walk-sums for inference since P = J7! = (I—R)~! = £; R* and 
u=J'h=Y,Rh. 


3.2 Walk-Sums for Inference 


Next we show that, in walk-summable models, means and variances correspond to walk-sums over 
certain sets of walks. 


Proposition 5 (WS Inference) /fJ = I —R is walk-summable, then the covariance P = J~' is given 
by the walk-sums: 


Pij = by (w). 


wij 
Also, the means are walk-sums reweighted by the value of h at the start of each walk: 
Ui = >. h,o(w) 
wie si 


where the sum is over all walks which end at node i (with arbitrary starting node), and where x 
denotes the starting node of the walk w. 


Proof. We use the fact that (R’);; = X t OW). Then, 
Py=LRW=L YL Ow) = X ow) 
l l TR wij 
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Single walk: w = (1,2,3). Weight: o(w) =ri2r23, 
o;,(w) = hiri 2123. 


Self-return walks, W(1 > 1): {(1), (1,2,1), (1,3,1), 
(1,2,3,1), (1,3,2,1), (1,2,1,2, 1),...} 
Pia = O01 > 1) = 1 + nari + r1 33,1 +r12r23r31 +... 


2 T23 3 Set of walks W(x — 1): {(1), (2,1), (3,1), (2,3,1), 
(3,274) 4132, 1) Set ae} 
u = On(* > 1) = hı + hore) + h3r31 +h2r23r31 +... 











Figure 5: Illustration of walk-sums for means and variances. 


and 





Mi = LAP = YY hw) = X, how) oO 
j j wijbi wai 
Walk-Sum Notation We now provide a more compact notation for walk-sets and walk-sums. In 
general, given a set of walks W we define the walk-sum: 


o(W)= Yow) 


wEew 


and the reweighted walk-sum: 
on(W) = y hw (w) 
wEW 

where wo denotes the initial node in the walk w. Also, we adopt the convention that W(...) denotes 
the set of all walks having some property ... and denote the associated walk-sums simply as 0(...) 
or ,(...). For instance, W(i — j) denotes the set of all walks from i to j and o(i — j) is the 
corresponding walk-sum. Also, W(x — i) denotes the set all walks that end at node i and ọ}(* — i) 
is the corresponding reweighted walk-sum. In this notation, Pj; = ọ(i — j) and u; = on(* — i). An 
illustration of walk-sums and their connection to inference appears in Figure 5 where we list some 
walks and walk-sums for a 3-cycle graph. 


Walk-Sum Algebra We now show that the walk-sums required for inference in walk-summable 
models can be significantly simplified by exploiting the recursive structure of walks. To do so, we 
make use of some simple algebraic properties of walk-sums. The following lemmas all assume that 
the model is walk-summable. 


Lemma 6 Let W = UZ; W; where the subsets W, are disjoint. Then, o(W) = Yz $(W). 


Proof. By the sum-partition theorem for absolutely convergent series (Godement, 2004): 


Lwew O() = Ek Lwem, (w). L 





Lemma 7 Let W = UZ; Wk where W; C W1 for all k. Then, o(W) = limg_... (Wi). 
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Proof. Let Wp be the empty set. Then, W = UZ (m; \ m-1). By Lemma 6, 


N 
PW) = X, O(W\ M-1) = lim $ (PW) — O(M-1)) = lim (6(Wy) — 0(M)) 
Np Na 


Ms 


k=1 





where we use (Wo) = 0 in the last step to obtain the result. O 
Given two walks u = (uo,...,Un) and v = (vo, ...,Vm) with un = vo (walk v begins where walk 
u ends) we define the product of walks uv = (uo,...,Un,V1,---;Vm). Let U and V be two countable 
sets of walks such that every walk in U ends at a given node i and every walk in V begin at this 
node. Then we define the product set UV = {uv | u € U,v E€ V}. We say that (U, V) is a valid 
decomposition if for every w € UV there is a unique pair (u,v) € U x V such that uv = w. 


Lemma 8 Let (U,V) be a valid decomposition. Then, o(UV) = H U)V). 


Proof. For individual walks it is evident that (uv) = o(u)(v). Note that UV = Uncu uV, 
where the sets uV = {uv|v € V} are mutually disjoint. By Lemma 6, 


o(UV) = X ouV) = YY om) = YY ou)ov) = (x 7 (x w) 


uc uU ueudvyeV uceUveV ucu veV 





where we have used 6(uV) = Xey (uv) because uV is one-to-one with V. C 


Note that W(i — i) is the set of self-return walks at node i, that is, walks which begin and end 


at node i. These self-return walks include walks which return to i many times. Let W(i x i) be 


the set of all walks with non-zero length that begin and end at i but do not visit i at any other point 
in between. We call these the single-revisit self-return walks at node i. The set of self-return walks 


that return exactly k times is generated by taking the product of k copies of W(i EY i) denoted by 


Wi By i). Thus, we obtain all self-return walks as 
Wi i) =UpoW"(i i) (12) 


where W(i % 1) £ {(i)}. 
Similarly, recall that W(x — i) denotes the set of all walks which end at node i. Let W(x x i) 


denote the set of walks with non-zero length which end at node i and do not visit i previously (we 
call them single-visit walks). Thus, all walks which end at i are obtained as: 


Wi (yu Wer, i) Wii), 13) 


which is a valid decomposition. 
Now we can decompose means and variances in terms of single-visit and single-revisit walk- 
sums, which we will use in section 4.1 to analyze BP. 


Proposition 9 Let œ; = (i Bi i) and Bi = On(* x i). Then, 


h: +B; 
and w= itẹ; 


Pii = ; 
1—0; 1-Q; 
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Figure 6: (a) A frustrated model defined on G with one negative edge (r > 0). (b) The corresponding 
attractive model defined on G. 


Proof. First note that the decomposition of W*(i a i) into products of k single-revisit self- 
return walks is a valid decomposition. By Lemma 8, 6(W*(i a i))=¢t(i i i) = až. Then, by (12) 


and Lemma 6: i 


1- 





P; = o(i > i) = Yak = 
k 
Walk-summability of the model implies convergence of the geometric series (i.e., |@;| < 1). Lastly, 


the decomposition in (13) implies 


hi+B; 
l-a; 


O 





wi = On(* > 8) = (hy + On(* © ))O(i >) = 





3.3 Correspondence to Attractive Models 


We have already shown that attractive models are walk-summable. Interestingly, it turns out that 
inference in any walk-summable model can be reduced to inference in a corresponding attractive 
model defined on a graph with twice as many nodes. The basic idea here is to separate out the walks 
with positive and negative weights. 

Specifically, let G = (V,E) be defined as follows. For each node i € V we define two corre- 
sponding nodes i, € V} and i_ € V_, and set V = V} UV_. For each edge {i, j} € E with rij > Owe 
define two edges {i}, j,},{i_,j_} € Ê, and set the partial correlations on these edges to be equal 
to rj;. For each edge {i, j} € E with r;; < 0 we define two edges {i,, j_}, {i—, j+} € G, and set the 
partial correlations to be —r;;. See Figure 6 for an illustration. 

Let (R+);; = max{Rj;,0} and (R_);; = max{—R;;,0}. Then R can be expressed as the difference 


of these non-negative matrices: R = R4 —R_. Based on our construction, we have that Ê = a A 


and f= I — R. This defines a unit-diagonal information matrix f on G. Note that if f > 0 then this 
defines a valid attractive model. 


Proposition 10 f= I — Ê > 0 if and only if J = I — R is walk-summable. 


The proof relies on the Perron-Frobenius theorem and is given in Appendix A. Now, let h = 
hy —h_ with (h,); = max{h;,0} and (h_); = max{—h;,0}. Define h = (2): Now we have the 


information form model (h,S) which is a valid, attractive model and also has non-negative node 
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potentials. Performing inference with respect to this augmented model, we obtain the mean vector 
f= (a * f-'h and covariance matrix Ê = Gs E £ f-'. From these calculations, we can 


obtain the moments (u, P) of the original walk-summable model (h, J): 





Proposition 11 P = Ê, —P,_ and u =f, —p. 


The proof appears in Appendix A. This proposition shows that estimation of walk-summable 
models may be reduced to inference in an attractive model in which all walk-sums are sums of posi- 
tive weights. In essence, this is accomplished by summing walks with positive and negative weights 
separately and then taking the difference, which is only possible for walk-summable models. 


3.4 Pairwise-Normalizability 


To simplify presentation we assume that the graph does not contain any isolated nodes (a node 
without any incident edges). Then, we say that the information matrix J is pairwise-normalizable 
(PN) if we can represent J in the form 

J= Ł [Je] 


ecE 


where each J, is a 2 x 2 symmetric, positive definite matrix.!* The notation [Je] means that Je is 
zero-padded to a |V| x |V| matrix with its principal submatrix for {i, j} being Je (with e = {i, j}). 
Thus, x? [J.|x = x! Jexe. Pairwise-normalizability implies that J + 0 because each node is covered 
by at least one positive definite submatrix Je. Let Jpy denote the set of n x n pairwise-normalizable 
information matrices J (not requiring unit-diagonal normalization). This set has nice convexity 
properties. Recall that a set X is convex if x,y E€ X implies Ax+(1—A)y € X for all0 < à < 1 and 
is a cone if x € X implies ax € X for all œ > 0. A cone X is pointed if XN —X = {0}. 


Proposition 12 (Convexity of PN models) The set Jpy is a convex pointed cone. 


The proof is in Appendix A. We now establish the following fundamental result: 
Proposition 13 (WS = PN) J =I —R is walk-summable if and only if it is pairwise-normalizable. 


Our proof appears in in Appendix A. An equivalent result has been derived independently in 
the linear algebra literature: Boman et al. (2005) establish that symmetric H-matrices with positive 
diagonals (which is equivalent to WS by part (iv) of Proposition 1) are equivalent to matrices with 
factor width at most two (PN models). However, the result PN = WS was established earlier by 
Johnson (2001). Our proof for WS = PN uses the Perron-Frobenius theorem, whereas Boman et al. 
(2005) use the generalized diagonal dominance property of H-matrices. 

Equivalence to pairwise-normalizability gives much insight into the set of walk-summable mod- 
els. For example, the set of unit-diagonal J matrices that are walk-summable is convex, as it is the 
intersection of Jpy with an affine space. Also, the set of walk-summable J matrices that are sparse 
with respect to a particular graph G (with some entries of J are restricted to 0) is convex. 

Another important class of models are those that have a diagonally dominant information ma- 
trix, that is, where for each 7 it holds that Y jži |J; jl < Jij. 





14. An alternative definition of pairwise-normalizability is the existence of a decomposition J = cl + Yer [Je], where 
c > 0, and Je = 0. For graphs without isolated nodes, both definitions are equivalent. 
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Figure 7: Illustration of the subtree notation, 7;_,; and Th x 


Proposition 14 Diagonally dominant models are pairwise-normalizable (walk-summable). 


A constructive proof is given in Appendix A. The converse does not hold: not all pairwise- 
normalizable models are diagonally dominant. For instance, in our example of a 4-cycle with a 
chord shown in Figure 3(a), with r = .38 the model is not diagonally dominant but is walk-summable 
and hence pairwise-normalizable. 


4. Walk-sum Interpretation of Belief Propagation 


In this section we use the concepts and machinery of walk-sums to analyze belief propagation. We 
begin with models on trees, for which, as we show, all valid models are walk-summable. Moreover, 
for these models we show that exact walk-sums over infinite sets of walks for means and variances 
can be computed efficiently in a recursive fashion. We show that these walk-sum computations map 
exactly to belief propagation updates. These results (and the computation tree interpretation of LBP 
recursions) then provide the foundation for our analysis of loopy belief propagation in Section 4.2. 


4.1 Walk-Sums and BP on Trees 


Our analysis of BP makes use of the following property: 


Proposition 15 (Trees are walk-summable) For tree structured models J > 0 = p(R) < 1 (i.e., all 
valid trees are walk-summable). Also, for trees p(R) = P(R) = Amax(R). 


Proof. The proof is a special case of the proof of Corollary 3. Trees are non-frustrated (as there 
are no cycles, let alone frustrated cycles) so they are walk-summable. Negating some variables 
makes the model attractive and does not change the eigenvalues. L 





The proposition shows that walk-sums for means and variances are always defined on tree- 
structured models, and can be reordered in arbitrary ways without affecting convergence. We rely on 
this fact heavily in subsequent sections. The next two results identify walk-sum variance and mean 
computations with the BP update equations. The ingredients for these results are decompositions 
of the variance and mean walk-sums in terms of sums over walks on subtrees, together with the 
decomposition in terms of single-revisit and single-visit walks provided in Proposition 9. 
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Walk-Sum Variance Calculation Let us look first at the computation of the variance at node j, 
which is equal to the self-return walk-sum (j — j). This can be computed directly from the single- 


revisit walk-sum 0; = 0(j M j) as in Proposition 9. This latter walk-sum can be further decomposed 
into sums over disjoint subsets of walks each of which corresponds to single-revisit self-return walks 
that exit node j via a specific one of its neighbors, say i. In particular, as illustrated in Figure 7, the 
single-revisit self-return walks that do this correspond to walks that live in the subtree 7;_,;. Using 


the notation W(j Y j | Ti—;) for the set of all single-revisit walks which are restricted to stay in 
subtree T;—; we see that 


sN Vi. 
a= >= ¥ GSTS VY aj. 
iEN(j) iEN (j) 
Moreover, every single-revisit self-return walk that lives in 7;,; must leave and return to node 
j through the single edge (i, j), and between these first and last steps must execute a (possibly 


multiple-revisit) self-return walk at node i that is constrained not to pass through node j, that is, to 
live in the subtree 7; ; indicated in Figure 7. Thus 


aij =O 2 j | To) = r30 > i | Taj) Sry. (14) 


We next show that the walk-sums 0; and 04;_,; (hence variances P;) can be efficiently calculated 
by a walk-sum analog of belief propagation. We have the following result: 


Proposition 16 Consider a valid tree model J = I —R. Then jj = —AJ;i— j and YA j = Jp where 


AJi—j and Kr; are the quantities defined in the Gaussian BP equations (7) and (8). 


See Appendix A for the proof. 
Walk-Sum Mean Calculation We extend the above analysis to calculate means in trees. Mean uj 


is the reweighted walk-sum over walks that start anywhere and end at node j, uj = 0n(* — j). Any 
walk that ends at node j can be expressed as a single-visit walk to node j followed by a multiple- 


revisit self-return walk from node j: 0;(* > j) = (1 + on(* Ay i) o(j — j), where the term hj; 


corresponds to the length-zero walk that starts and ends at node j. 

As we have done for the variances, the single-visit walks to node j can be partitioned into the 
single-visit walks that reach node j from each of its neighbors, say node i and thus prior to this last 
step across the edge (i, j), reside in the subtree T; j, so that 


ae j 
Bij Qaf > J | Tj) = rual > i | Tay). 


Proposition 17 Consider a valid tree model J =I — R. Then Bi— j = Ahi j, where Ahi; is the 
quantity defined in the Gaussian BP equation (8). 


The proof appears in Appendix A. 
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4.2 LBP in Walk-Summable Models 


In this subsection we use the LBP computation tree to show that LBP includes all the walks for the 
means, but only a subset of the walks for the variances. This allows us to prove LBP convergence 
for all walk-summable models. In contrast, for non-walksummable models LBP may or may not 
converge (and in fact the variances may converge but the means may not). As we will see in Section 
5, this can be analyzed by examining walk-summability (and hence validity) of the computation 
tree, rather than walk-summability of the original model. 

As we have discussed, running LBP for some number of iterations yields identical calculations 
at any particular node i to the exact inference calculations on the corresponding computation tree 


(n) 


rooted at node i. We use the notation T; ` for the nth computation tree at node i, T; for the full 


computation tree (as n — œ) and we assign the label O to the root node. Then, Po(T”) denotes 
the variance at the root node of the nth computation tree rooted at node i in G. The LBP variance 
estimate at node i after n steps is equal to 


PO = Po(7;") = 90 + 0| 7). 
Similarly, the LBP estimate of the mean u; after n steps of LBP is 
At” = p(T”) = pa > 0 |7). 


As we have mentioned, the definition of the computation trees pe depend upon the message 
schedule {M (n)) of LBP, which specifies which subset of messages are updated at iteration n. We 
say that a message schedule is proper if every message is updated infinitely often, that is, if for 
every m > 0 and every directed edge (i, j) in the graph there exists n > m such that (i, j) E€ M (n), 
Clearly, the fully parallel form is proper since every message is updated at every iteration. Serial 
forms which iteratively cycle through the directed edges of the graph are also proper. All of our 
convergence analysis in this section presumes a proper message schedule. We remark that as walk- 
summability ensures convergence of walk-sums independent of the order of summation, it makes 
the choice of a particular message schedule unimportant in our convergence analysis. The following 
result is proven in Appendix A. 


Lemma 18 (Walks in G and in T;) There is a one-to one correspondence between finite-length walks 
in G that end at i, and walks in T; that end at the root node. In particular, for each such walk in G 
there is a corresponding walk in ge for n large enough. 


Now, recall that to compute the mean u; we need to gather walk-sums over all walks that start 
anywhere and end at i. We have just shown that LBP gathers all of these walks as the computation 
tree grows to infinity. The story for the variances is different. The true variance P;; is a walk-sum 
over all self-return walks that start and end at i in G. However, walks in G that start and end at i 
may map to walks that start at the root node of po, but end at a replica of the root node instead of 
the root. These walks are not captured by the LBP variance estimate.!° The walks for the variance 


estimate p(T”) are self-return walks W(0 — 0 | 7”) that start and end at the root node in the 





15. Recall that the computation tree is a representation of the computations seen at the root node of the tree, and it is only 
the computation at this node—that is, at this replica of node i that corresponds to the LBP computation at node i in 
G. 
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computation tree. Consider Figure 2. The walk (1,2,3,1) is a self-return walk in the original graph 
G but is not a self-return walk in the computation tree shown in Figure 2(d). LBP variances capture 
only those self-return walks of the original graph G that are also self-return walks in the computation 
tree—for example, the walk (1,3,2,3,4,3, 1) is a self-return walk in both Figures 2(a) and (d). We 
call such walks backtracking. Hence, 


Lemma 19 (Self-return walks in G and in T;) The LBP variance estimate at each node is a sum 
over the backtracking self-return walks in G, a subset of all self-return walks needed to calculate 
the correct variance. 


Note that back-tracking walks for the variances have positive weights, since each edge in the 
walk is traversed an even number of times. With each LBP step the computation tree grows and new 
back-tracking walks are included, hence variance estimates grow monotonically. !6 

We have shown which walks LBP gathers based on the computation tree. The convergence 
of the corresponding walk-sums remains to be analyzed. In walk-summable models the answer is 
simple: 


Lemma 20 (Computation trees of WS models are WS) For a walk-summable model all its com- 
putation trees 7”) (for all n and i) are walk-summable and hence valid. 


Intuitively, walks in the computation tree p are subsets of the walks in G, and hence they con- 
verge. This implies that the computation trees are walk-summable, and hence valid. This argument 
can be made precise, but a shorter formal proof using monotonicity of the spectral radius (11) 
appears in Appendix A. Next, we use these observations to show convergence of LBP for walk- 
summable models. 


Proposition 21 (Convergence of LBP for walk-summable models) Zf a model on a graph G is 
walk-summable, then LBP is well-posed, the means converge to the true means and the LBP vari- 
ances converge to walk-sums over the backtracking self-return walks at each node. 


Proof. Let W(i BT i) denote the back-tracking self-return walks at node i. By Lemmas 18 and 
19, we have: 


Wei) = WnW(* = 0T) 
wii = UWO = 0T”). 
We note that the computation trees T® at node i are nested, 7”) G ee for all n. Hence, W(* > 


a7) c W« 3 0T, +”) and WO —> 0|T”) c WO 5 O|7!"*”). Then, by Lemma 7, we 
obtain the result: 


Hi=On(* >i) = lim o4(* > OT”) = lim a” 





I 


Pe) Aoli) = limo(o or”) = lim ĝ®. O 





16. Monotonically increasing variance estimates is a characteristic of the particular initialization of LBP that we use, 
that is, the potential decomposition (4) together with uninformative initial messages. If one instead uses a pairwise- 
normalized potential decomposition, the variances are then monotonically decreasing. 
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Figure 8: (a) LBP variances vs. iteration. (b) p(R,,) vs. iteration. 


Corollary 22 LBP converges for attractive, non-frustrated, and diagonally dominant models. In 
attractive and non-frustrated models LBP variance estimates are less than or equal to the true 
variances (the missing non-backtracking walks all have positive weights). 


In Weiss and Freeman (2001) Gaussian LBP is analyzed for pairwise-normalizable models. 
They show convergence for the case of diagonally dominant models, and correctness of the means 
in case of convergence. The class of walk-summable models is strictly larger than the class of diag- 
onally dominant models, so our sufficient condition is stronger. They also show that LBP variances 
omit some terms needed for the correct variances. These terms correspond to correlations between 
the root and its replicas in the computation tree. In our framework, each such correlation is a walk- 
sum over the subset of non-backtracking self-return walks in G that, in the computation tree, begin 


at a particular replica of the root. 


Example 2. Consider the model in Figure 3(a). We summarize various critical points for this 
model in Figure 9. For 0 < r < .39039 the model is walk-summable and LBP converges; then for 
a small interval .39039 < r < .39865 the model is not walk-summable but LBP still converges, and 
for larger r LBP does not converge. We apply LBP to this model with r = 0.39,0.395 and 0.4, and 
plot the LBP variance estimates for node 1 vs. the iteration number in Figure 8(a). LBP converges 
in the walk-summable case for r = .39, with p(R) ~ .9990. It also converges for r = 0.395 with 
p(R) =~ 1.0118, but soon fails to converge as we increase r to 0.4 with p(R) ~ 1.0246. 

Also, for r = .4, we note that p(R) = .8 < 1 and the series X; R’ converges (but YR’ does not) 
and LBP does not converge. Hence, p(R) < 1 is not sufficient for LBP convergence showing the 
importance of the stricter walk-summability condition p(R) < 1. 


5. LBP in Non-Walksummable Models 


While the condition in Proposition 21 is necessary and sufficient for certain special classes of 
models—for example, for trees and single cycles—it is only sufficient more generally, and, as 
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in Example 2, LBP may converge for some non-walksummable models. We extend our analy- 
sis to develop a tighter condition for convergence of LBP variances based on a weaker form of 
walk-summability defined with respect to the computation trees (instead of G). We have shown in 
Proposition 15 that for trees walk-summability and validity are equivalent, and p(R) < 1 & p(R) < 
1<J> 0. Hence, our condition essentially corresponds to validity of the computation tree. 

First, we note that when a model on G is valid (J is positive definite) but not walk-summable, 
then some finite computation trees may be invalid (indefinite). This turns out to be the primary 
reason why belief propagation can fail to converge. Walk-summability on the original graph implies 
walk-summability (and hence validity) on all of its computation trees. But if the model is not walk- 
summable, then its computation tree may or may not be valid. 

We characterize walk-summability of the computation trees as follows. Let Tr be the nth 


(n) 4] - J” where J® is the normalized 


(n) 


computation tree rooted at some node i. We define R 


(n) 


information matrix for T; is walk- 


summable (valid) if and only if p(R®) < 1 due to the fact that 0(R\”) = p(R®) for trees. We are 


i 


and / is an identity matrix. The nth computation tree T; 


interested in the validity of all finite computation trees, so we consider the quantity lim,_,.. p(R®). 
Lemma 23 guarantees the existence of this limit: 


Lemma 23 The sequence {p (R )} is monotonically increasing and bounded above by p(R). Thus, 
(n)y 


lity p(R™) exists, and is equal to sup, p(R; 
In the proof we use k-fold graphs, which we introduce in Appendix B. The proof appears in Ap- 
pendix A. The limit in Lemma 23 is defined with respect to a particular root node and message 
schedule. The next lemma shows that for connected graphs, as long as the message schedule is 


proper, they do not matter. 


Lemma 24 For connected graphs and with proper message schedule, pæ = limy—... p(R®) is inde- 
pendent of i. The limit does not change by using any other proper message schedule. 


This independence results from the fact that for large n the computation trees rooted at different 
nodes overlap significantly. Technical details of the proof appear in Appendix A. Using this lemma 
we suppress the dependence on the root node i from the notation to simplify matters. The limit p.. 
turns out to be critical for convergence of LBP variances: 


Proposition 25 (LBP validity/variance convergence) (i) If po < 1, then all finite computation 
trees are valid and the LBP variances converge to walk-sums over the back-tracking self-return 
walks. (ii) If Po > 1, then the computation tree eventually becomes invalid and LBP is ill-posed. 


Proof. (i) Since Peo = lim,_,..p(R) < 1 and the sequence {p(R™)} is monotonically increas- 
ing, then there exists 5 > 0 such that p(R™) < 1 — ô for all n. This implies that all the computation 
trees 7) are walk-summable and that variances monotonically increase (since weights of back- 
tracking walks are positive, see the discussion after Lemma 19). We have that Amax(R”)) <1-6, 
so Amin (J (”)) > 6 and Amax(P (")) < b. The maximum eigenvalue of a matrix is a bound on the max- 
imum entry of the matrix, so (P”) i < max (P (»)) < b. The variances are monotonically increasing 
and bounded above, hence they converge. 

(ii) If lim, ...p(R™)) > 1, then there exists an m such that p(R®™®) > 1 for all n > m. This means 
that these computation trees T”) are invalid, and that the variance estimates at some of the nodes 
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are negative. LI 


As discussed in Section 2.2, the LBP computation tree is valid if and only if the information 





parameters fe and J” in (7), (9) computed during LBP iterations are strictly positive for all n. 
Hence, it is easily detected if the LBP computation tree becomes invalid. In this case, continuing to 
run LBP is not meaningful and will lead to division by zero (if the computation tree is singular) or 
to negative variances (if it is not positive definite). 

Recall that the limit p.. is invariant to message order by Lemma 24. Hence, by Proposition 25, 
convergence of LBP variances is likewise invariant to message order (except possibly when p.. = 1). 
The limit p.. is bounded above by p(R), hence walk-summability in G is a sufficient condition for 
well-posedness of the computation tree: po < p(R) < 1. However, the bound is not tight in general 
(except for trees and single cycles). This is related to the phenomenon that the limit of the spectral 
radius of the finite computation trees can be less than the spectral radius of the infinite computation 
tree (which has no leaf nodes). See He et al. (2000) for analysis of a related discrepancy. 


Means in non-WS models For the case where p.. < 1 < p(R), the walk-sums for LBP variances 
converge absolutely (see proof of Proposition 25), but the walk-sums for the means do not. The 
reason is that LBP only computes a subset of the self-return walks for the variances but captures 
all the walks for the means. However, the series LBP computes for the means, corresponding to a 
particular ordering of walks, may still converge. 

It is well known (Rusmevichientong and Van Roy, 2001) that once variances converge, the 
updates for the means follow a linear system. Consider (7) and (8) with VAN j fixed, then the LBP 
messages for the means Ah = (Ah;—; | {i, j} € E) follow a linear system update. For the parallel 
message schedule we can express this as: 


Ahd = L Ah” +b (15) 


for some matrix L and some vector b. Convergence of this system depends on the spectral radius 
p(L). However, it is difficult to analyze p(L) since the matrix L depends on the converged values of 
the LBP variances. To improve convergence of the means, one can damp the message updates by 
modifying (8) as follows: 

Ant") = (1-0) Ah +a) with 0<a<1. (16) 
We have observed in experiments that for all the cases where variances converge we also obtain 
convergence of the means with enough damping of BP messages. We have also tried damping 
the updates for the AJ messages, but whether or not variances converge appears to be independent 
of damping. Apparently, it is the validity of the computation tree (Pp. < 1) that is essential for 
convergence of both means and variances in damped versions of Gaussian LBP. 

Example 3. We illustrate Proposition 25 on a simple example. Consider the 5-node cycle model 
from Figure 3(b). In Figure 8(b), for p = .49 we plot p(R,,) vs. n (lower curve) and observe that 
limy 0 P(Rn) ~ .98 < 1, and LBP converges. For p = .51 (upper curve), the model defined on the 
5-node cycle is still valid but lim,_,.. P(Rn) ~ 1.02 > 1 so LBP is ill-posed and does not converge. 

As we mentioned, in non-walksummable models the series that LBP computes for the means 
is not absolutely convergent and may diverge even when variances converge. For our 4-cycle with 
a chord example in Figure 3(a), the region where variances converge but means diverge is very 
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Figure 9: Critical regions for example models from Figure 3. (a) 4-cycle with a chord. (b) 5-cycle. 
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Figure 10: The 4-cycle with a chord example. (a) Convergence and divergence of the means near 
the LBP mean critical point. (b) Variance near the LBP variance critical point: (top) 
number of iterations for variances to converge, (bottom) true variance, LBP estimate and 
the error at node 1. 


narrow, r ~ .39865 to r ~ .39867 (we use the parallel message schedule here; the critical point for 
the means is slightly higher using a serial schedule). In Figure 10(a) we show mean estimates vs. the 
iteration number on both sides of the LBP mean critical point for r = 0.39864 and for r = 0.39866. 
In the first case the means converge, while in the latter they slowly but very definitely diverge. The 
spectral radius of the linear system for mean updates in (15) for the two cases is p(L) = 0.99717 < 1 
and p(L) = 1.00157 > 1 respectively. In the divergent example, all the eigenvalues of L have real 
components less than 1 (the maximum such real component is 0.8063 < 1). Thus by damping we 
can force all the eigenvalues of L to enter the unit circle: the damped linear system is (1 — aM + QL. 
Using & = 0.9 in (16) the means converge. 

In Figure 10(b) we illustrate that near the LBP variance critical point, the LBP estimates become 
more difficult to obtain and their quality deteriorates dramatically. We consider the graph in Figure 
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Figure 11: Venn diagram summarizing various subclasses of Gaussian models. 


3(a) again as r approaches 0.39867, the critical point for the convergence of the variances. The 
picture shows that the number of iterations as well as the error in LBP variance estimates explode 
near the critical point. In the figure we show the variance at node 1, but similar behavior occurs at 
every node. In Figure 9, we summarize the critical points of both models from Figure 3. 


6. Conclusion 


We have presented a walk-sum interpretation of inference in Gaussian graphical models, which 
holds for a wide class of models that we call walk-summable. We have shown that walk-summability 
encompasses many classes of models which are considered “easy” for inference—trees, attractive, 
non-frustrated and diagonally dominant models—but also includes many models outside of these 
classes. A Venn diagram summarizing relations between these sets appears in Figure 11. We have 
also shown the equivalence of walk-summability to pairwise-normalizability. 

We have established that in walk-summable models LBP is guaranteed to converge, for both 
means and variances, and that upon convergence the means are correct, whereas the variances only 
capture walk-sums over back-tracking walks. We have also used the walk-summability of valid (i.e., 
positive definite) models on trees to develop a more complete picture of LBP for non-walksummable 
models, relating variance convergence to validity of the LBP computation tree. 

There are a variety of directions in which these results can be extended. One involves developing 
improved walk-sum algorithms that gather more walks than LBP does, to yield better variance 
estimates. Results along these lines—involving vectors of variables at each node as well as factor 
graph versions of LBP that group larger sets of variables—will be presented in a future publication. 
Another direction is to apply walk-sum analysis to other algorithms for Gaussian inference, for 
example, Chandrasekaran et al. are applying walk-sums to better understand the embedded trees 
algorithm (Sudderth et al., 2004). 

Our current work is limited to Gaussian models, as walk-sums arise from the power series ex- 
pansion for the matrix inverse. However, related expansions of correlations in terms of walks have 
been investigated for other models. Fisher (1967) developed an approximation to the pairwise cor- 
relations in Ising models based on self-avoiding walks. Brydges et al. (1983) use walk-sums for 
non-Gaussian classical and quantum spin-systems, where the weights of walks involve complicated 
multi-dimensional integrals. It would be very useful to develop ways to compute or approximate 
self-avoiding or non-Gaussian walk-sums efficiently and extend the walk-sum perspective to infer- 
ence in a broader class of models. 
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Appendix A. Detailed Proofs 


Proof of Proposition 1 Proof of (i) = (ii). We examine convergence of the matrix series in (ii) 
element-wise. First note that (R’); jis an absolute walk-sum over all walks of length / from i to j: 


R= Y low) 


a en 
wij 


(there are a finite number of these walks so the sum is well-defined). Now, if (i) holds then using 
properties of absolute convergence we can order the sum $; j |O(w)| however we wish and it still 
converges. If we order walks by their length and then group terms for walks of equal lengths (each 
group has a finite number of terms) we obtain: 


y Nae E 2 lo) = ER). (17) 


wiag wibj 
Therefore, the series Y)(R’); j converges for all i, j. 

Proof of (ii) = (i). To show convergence of the sum ).,,.;_,;|O(w)| it is sufficient to test con- 
vergence for any convenient ordering of the walks. As shown in (17), ¥)(R’); j corresponds to one 
particular ordering of the walks which converges by (ii). Therefore, the walk-sums in (i) converge 
absolutely. 

Proof of (ii) = (iii). This is a standard result in matrix analysis (Varga, 2000). 

Proof of (iii) = (iv). Note that A is an eigenvalue of R if and only if 1 — A is an eigenvalue of 
I—R (Rx = àx & (I—R)x = (1—A)x). Therefore, Amin(J — R) = 1 —Amax(R). According to the 
Perron-Frobenius theorem, p(R) = Amax(R) because R is non-negative. Thus, p(R) = 1 — Amin — R) 
and we have that p(R) < 1 & Àmin (7 — R) > 0. O 





Proof of Corollary 3 We will show that for any non-frustrated model there exists a diagonal D 
with D; = +1, that is, a signature matrix, such that DRD = R. Hence, R and R have the same 
eigenvalues, because DRD = DRD”! is a similarity transform which preserves the eigenvalues of a 
matrix. It follows that 7 — R > 0 implies J — R > 0 and walk-summability of J by Proposition 1 (iv). 

Now we describe how to construct a signature similarity which makes R attractive for non- 
frustrated models. We show how to split the vertices into two sets V* and V~ such that negating V7 
makes the model attractive. Find a spanning tree T of the graph G. Pick a node i. Assign it to VT. 
For any other node j, there is a unique path to i in T. If the product of edge weights along the path 
is positive, then assign j to V”, otherwise to V-. Now, since the model is non-frustrated, all edges 
{j,k} in G such that j,k € V~ are positive, all edges with j,k € VT are positive, and all edges with 
j €V* and k € V7 are negative. This can be seen by constructing the cycle that goes from j to i to 
k in T and crosses the edge {k, j} to close itself. If j,k € V* then the paths j to i and i to k have a 
positive weight, hence in order for the cycle to have a positive weight, the last step {k, j} must also 
have a positive weight. The other two cases are similar. Now let D be diagonal with Dj; = 1 for 


i € V*, and Dy = —1 for i € V~. Then DRD = | -pt , Wi | > 0, that is, DRD = R. O 








v- 


Proof of Proposition 4 Proof of WS => (i). WS is equivalent to p(R) < 1 by Proposition 1. But 
p(R) < p(R) by (11). Hence, p(R) < 1 => p(R) < 1. 
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Proof of (i) = (ii). Given J = I —R, it holds that Amin(J) = 1 —Amax(R). Also, Amax(R) < p(R). 
Hence, Amin(J) = 1 —Amax(R) > 1 —p(R) > 0 for p(R) < 1. 
Proof of (i) = (iii). This is a standard result in matrix analysis. L 





Proof of Proposition 10 Assume that G is connected (otherwise we apply the proof to each con- 
nected component, and the spectral radii are the maxima over the respective connected components). 
We prove that p(R) = p(R). By the Perron-Frobenius theorem, there exists a positive vector x such 
that Rx = p(R)x. Let £ = (x;x). Then R£ = p(R)£ because 


(Re) = (Ry +R_)x = Re = p(R)x. 





Hence, p(R) is an eigenvalue of R with positive eigenvector £. First suppose that G is connected. 
Then, by the Perron-Frobenius theorem, p(R) = p(R) because R has a unique positive eigenvector 
which has eigenvalue equal to p(Ñ). Now, /=1—R>+ 0 & Sis WS & p(Ñ) < 1 < p(R) <1 6 
J =1-—R is WS. If G is disconnected then R is a block-diagonal matrix with two copies of R (after 
relabeling the nodes), so p(Ñ) = p(R). O 





Proof of Proposition 11 We partition walk-sums into sums over “even” and “odd” walks accord- 
ing to the number of negative edges crossed by the walk. Thus a walk w is even if o(w) > 0 and is 
odd if 6(w) < 0. The graph G is defined so that every walk from i, to j+ is even and every walk 
from i, to j- is odd. Thus, 


Ph = YE ow)t+ } ow) 


even wii j odd w:i> j 

= E ôw- È ow 
wit j+ wii}—> j- 

= Fg Eije 


The second part of the the proposition follows by similar logic. Now we classify a walk as even if 
hy,d(w) > 0 and as odd if hy,(w) < 0. Note also that setting h = (h;h_) has the effect that all 
walks with Aw, > 0 begin in V, and all walks with hw, < 0 begin in V_. Consequently, every even 
walk ends in V, and every odd walk ends in V_. Thus, 


li = 2 h.(w) + 2 h.(w) 





even w:x—>i odd w:*—i 

= Yo how)- YO how) 
Wiki Wiki 

= f,-A_ O 


Proof of Proposition 12 Take J; and J pairwise-normalizable. Take any o,f > 0 such that at 
least one of them is positive. Then oJ; + BJ2 is also pairwise-normalizable simply by taking the 
same weighted combinations of each of the J, matrices for J; and J2. Setting B = 0 shows that Jpy 
is a cone, and setting B = 1 — a shows convexity. The cone is pointed since it is a subset of the cone 
of semidefinite matrices, which is pointed. LI 
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Proof of Proposition 13 Proof of PN =WS. It is evident that any J matrix which is pairwise- 
normalizable is positive definite. Furthermore, reversing the sign of the partial correlation coeffi- 
cient on edge e simply negates the off-diagonal element of Je which does not change the value of 
det J, so that we still have Je = 0. Thus, we can make all the negative coefficients positive and the re- 
sulting model J — R is still pairwise-normalizable and hence positive definite. Then, by Proposition 
lav), J = I —R is walk-summable. 

Proof of WS = PN. Given a walk-summable model J = I — R we construct a pairwise-normalized 
representation of the information matrix. We may assume the graph is connected (otherwise, we 
may apply the following construction for each connected component of the graph). Hence, by the 
Perron-Frobenius theorem there exists a positive eigenvector x > 0 of R such that Rx = Ax and 
A = p(R) > 0. Given (x, à) we construct a representation J = ),[J.] where for e = {i, j} we set: 


[rig hej 
gees AGS 
aa righ J * 
— 3 x pikit ids 
ij T 


This is well-defined (there is no division by zero) since x and À are positive. First, we verify that 
J = ecg [Je]. It is evident that the off-diagonal elements of the edge matrices sum to —R. We 
check that the diagonal elements sum to one: 


sa a — (Rei _ A 
elit = Tog, bs Ti i: 1. 


€ 





Next, we verify that each Je is positive definite. This matrix has positive diagonal and determinant 


_ (Iraka) (Iryl ES os 
dete = ( A, Ja (=r) =r} Vi > 0. 


The inequality follows from walk-summability because 0 < À < 1 and hence (a — 1) > 0. Thus, 
Je > 0. L 








Proof of Proposition 14 Let a; = Ji — ÈX jz; Jij]. Note that a; > 0 follows from diagonal domi- 
nance. Let deg(i) denote the degree of node i in G. Then, J = Xeeg|Je] where for edge e = {i, j} 


we set 
ial Malte Wa 
fj Wul+ æ 
with all other elements of [Je] set to zero. Note that: 
os 
ve } (pa y) =a+ Yo [Jy] =Ji. 


e JENG JEN (i) 


Also, Je has positive diagonal elements and has determinant det(J.) > 0. Hence, Je > 0. Thus, J is 
pairwise-normalizable. LI 
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Proof of Proposition 16 To calculate the walk-sum for multiple-revisit self-return walks in T; ;, 
we can use the single-revisit counterpart: 


Dm Se 1 
yuy =O > i| Tj) = 7 
1-0() 811%) 


Now, we decompose the single-revisit walks in the subtree T4 j in terms of the possible first step 
of the walk (i,k), where k € N(i)\j. Hence, 





(18) 


gi 2i1/R)= E iil T). (19) 

ken(i)\j 
Using (14), (18), and (19), we are able to represent the walk-sum ọ( j y j | T—;) in Tj; in terms of 
the walk-sums (i a i | Tk—i) on smaller subtrees 7;_,;. This is the basis of the recursive calculation: 


1 
2 





reac) i 


These equations look strikingly similar to the belief propagation updates. Combining (7) and (8) 
from Section 2.1 we have: 





1 
—AJj_,; = J} 
FT + Drea j Wei 
It is evident that the recursive walk-sum equations can be mapped exactly to belief propagation 
updates. In normalized models J; = 1. We have the message update o;-,; = —AJ;_,;, and the 


. . . . z= r—1 
variance estimate in the subtree Th; is Yj; = Jaj O 





Proof of Proposition 17 A multiple-revisit walk in 7; ; can be written in terms of single-visit 
walks: 


ba(* >il Ta) = (nton i i | T))) (i i | Taj). 


We already have ya; = 0(i — i | Taj) from (18). The remaining term 6, (* a | Taj) can be 
decomposed by the subtrees in which the walk lives: 


Sil m= Y one Sil tod. 
keN(i)\j 
Thus we have the recursion: 
Bij = ryth YS Bro. 
kEN(i)\j 


To compare this to the Gaussian BP updates, let us combine (7) and (8) in Section 2.2: 
Ahi; = Jhr [is Ł ams) : 
ken(i)\j 


Thus BP updates for the means can also be mapped exactly into recursive walk-sum updates via 
Bay = Ah j. O 
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Proof of Lemma 18 First, we note that for every walk w which ends at the root node of aT there 
is a corresponding walk in G which ends at i. The reason is that the neighbors of a given node j in 
7”) correspond to a subset of the neighbors of j in G. Hence, for each step (wg, wg+1) of the walk 
in 7”) there is a corresponding step in G. 

Next, we show that every walk w = (wo,...,wz) in G is contained in ae for some n. First 


consider the parallel message schedule, for which the computation tree T% grows uniformly. Then 


for any walk in G that ends at w; and has length n there is a walk in p” that ends at the root. 

The intuition for other message schedules is that every step (i, j) of the walk will appear even- 
tually in any proper message schedule M. A formal proof is somewhat technical. First we unwrap 
the walk w into a tree Tẹ, rooted at w; in the following way: start at w;, the end of the walk, and 
traverse the walk in reverse. First add the edge {w;,w1—1 } to Tẹ. Now, suppose we are at node w, in 
T„ and the next step in w is {wz,wg—1 }. If wg—1 is already a neighbor of wz in Tẹ then set the current 
node in Tọ, to wg—1. Otherwise create a new node w z_; and add the edge to Tọ. It is clear that loops 
are never made in this procedure, so Tọ is a tree. 

We now show for any proper message schedule M that Tẹ is part of the computation tree 


Te for some n. Pick a leaf edge {i1, jı} of Ty. Since {M)} is proper, there exist nı such 


that (i1,j1) € M\"). Now (i1,j1) € T™) and the edge appears at the root of 7" 1) Also, 


yoy? yj" 


ae Cc es for m > nj, so this holds for all subsequent steps as well. Now remove {i1, ji} 


from Tẹ and pick another leaf edge {iz, j2}. Again, since {M")} is proper, there exist m > nj 
such that (i2, j2) E M (m). Remove {i2, j2} from Tw, and continue similarly. At each such point 
ng of eliminating some new edge {ix, jz} of Tw, the whole eliminated subtree of T,, extending from 


ne) 


ip ji? 


{ix, jx} has to i ele to T, Continue until just the root of Ty remains at step n. Now the com- 


in) 


inj 


putation tree T% (which is created by splicing together T; 
of T,,) contains 7,,, and hence it contains the walk w. L 


for all edges (i, j) coming into the root 





Proof of Lemma 20 This result comes as an immediate corollary a Proposition 28, which states 


that p(R®) < p(R) (here R” ") is the partial correlation matrix for re )y, For WS models, p(R) < 
and the result follows. LI 





Proof of Lemma 23 The fact that the sequence { p(R! i is bounded by p(R) is a nontrivial fact, 
proven in Appendix B using a k-fold graph Gorton To prove monotonicity, note first that 


for trees 0(R\”) = p(R®). Also, note that all of the variables in the computation tree T,”) are 


also present in TETE, We zero-pad RO to make it the same size as ROD 
the spectral radius). Then it holds that RO < RD element-wise. Using (11), it follows that 


p(R” Ne <p(R Re"), establishing monotonicity. 0 


(this does not change 





Proof of Lemma 24 Let 7”) (M) denote the nth computation tree under a proper TR sched- 
ule M rooted at node i. We use the following simple extension of Lemma 18: Let T DCM) be the 
nth computation tree rooted at i under message aes ie Take pe node in T! » CM) which is 
a replica of node j in G. Then there exists m such that T! » (M) CT; (m) (M2), where Mh is another 
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G Gi G2 
(a) (b) 
Figure 12: Illustration of (a) graph G and (b) a 2-fold graph of G. 


message schedule. The proof parallels that of Lemma 18: the tree T (M) has a finite number of 
edges, and we use induction adding one edge at a time. 


Consider message schedule M1. By Lemma 23, p; = limps P(R™ (M)) exists. For any € 
pick an L such that for n > L it holds that |p (RO) (M )) —pi| < $. Pick a replica of node j in- 
side T (M). Then using the property from the previous paragraph, there exists M such that 
TP (M) a T” (Mg). Similarly there exists N such that T” (m) C 7) (M). It follows 
that RO (M) < RC” (90) < RO (M), where we zero-pad the first two matrices to have the 
same size as the last one. Then, p(R® (m )) < p(R” (9) < p(R™ (M )). Then it holds that 
pi—§ < p(R) (M)) < pit §. Hence, |p(R™) (M)) —p;| < e, and lim,_... p(RY" (MG) = p. E 





Appendix B. K-fold Graphs and Proof of Boundedness of p(R\”)). 
Consider an arbitrary graph G = (V,E). Suppose that we have a pairwise MRF defined on G with 


self potentials w;(x;), for v; € V and pairwise potentials y;;(x;,xj) for (vi,vj) € E. We construct a 
family of K-fold graphs based on G as follows: 


1. Create K disconnected copies Gx, k € {1,..,K} of G, with nodes y, and edges Oe a). 
The nodes and the edges of G; are labeled in the same way as the ones of G. The potentials 


yi and wy; are copied to the corresponding nodes and edges in all Gx. 


2. Pick some pair of graphs Gg, G;, and choose an edge (v;,v;) in G. We flip the corresponding 
edges in Gx and G;, edges (vy) and wv) become Go) and (0): The 


pairwise potentials are adjusted accordingly. 


3. Repeat step 2 an arbitrary number of times for a different pair of graphs Gx, or a different 
edge in G. 


An illustration of the procedure appears in Figure 12. The original graph G is a 4-cycle with a 
chord. We create a 2-fold graph based on G by flipping the edges (1,2) in G; and (1’,2’) in G2. 

Now we apply the K-fold graph construction to Gaussian MRF models. Suppose that we have 
a model with information parameters J and h on G. Suppose that J is normalized to have unit- 
diagonal. Let G* be a K-fold graph based on G with the information matrix J* (which is also 
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unit-diagonal by construction). Also, let rg 


i 


be the nth computation tree for the original graph, and 
the corresponding information matrix (also unit-diagonal). Let R = I — J, R£ = I — JŠ, and 


R” =f") — J” (here I, IX, and I” are identity matrices of appropriate dimensions). 
Lemma 26 (Spectral radii of R and RÉ) For any K-fold graph GX based on G: p(R*) = p(R). 


Proof. Suppose that G is connected (otherwise apply the proof to each connected component 
of G, and the spectral radius for G will be the maximum of the spectral radii for the connected 
components). 

Then, by the Perron-Frobenius theorem there exists a vector x > 0 such that Rx = p(R)x. Create 
a K-fold vector x* by copying entry x; into each of the K corresponding entries of x*. Then x* is 
positive, and it also holds that RXx* = p(R)x* (since the local neighborhoods in G and G* are the 
same). Now R* is a non-negative matrix, and x is a positive eigenvector, hence it achieves the 
spectral radius of RX by the Perron-Frobenius theorem. Thus, p(R) = p(R*). O 





The construction of a K-fold graph based on G has parallels with the computation tree on G. 
The K-fold graph is locally equivalent to G and the computation tree, except for its leaf nodes, is 
also locally equivalent to G. We show next that the computation tree 1,” 


for K large enough. 


is contained in some GË 


Lemma 27 (K-fold graphs and computation trees) Consider a computation tree 1,” 


(n) 


correspond- 
ing to graph G. There exists a K-fold graph G*, which contains T; ` as a subgraph, for K large 


enough. 


Proof. We provide a simple construction of a K-fold graph, making no attempt to minimize K. 
Let 7”) = (V,,E,). Each node v’ € V, corresponds to some node v € V in G. We create a K-fold 
graph GË by making a copy Gy of G for every node v' € po, Hence K = |V,|. For each edge 
(u',v’) € En in the computation tree, we make an edge flip between nodes in graphs G and Gy that 


correspond to u and v in G. This operation is well-defined because edges in T” 
(n) 


that map to the 





same edge in G do not meet. Thus, the procedure creates G* which contains T,™ as a subgraph. O 


(n) 


Finally, we use the preceding lemmas to prove a bound on the spectral radii of the matrices R; 


(n) 


for the computation tree T; ’. 


Proposition 28 (Bound on p(R®) For computation tree 7”: p(R®) <p(R). 

Proof. Consider a computation tree ph, Recall that p(R®) = p(R®), since T® is a tree. Use 
Lemma 27 to construct a K-fold graph GË which has TE as a subgraph. Zero-padding RO 
the same size as RÝ, it holds that RO < RK, Since R” < RX, using (11) and Lemma 26 we have: 


p(R"”) < p(R*) =p(R). 0 


to have 
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